Back

npj Digital Medicine

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match npj Digital Medicine's content profile, based on 97 papers previously published here. The average preprint has a 0.23% match score for this journal, so anything above that is already an above-average fit.

1
Wearables Anticipate Postoperative Complications: A Prospective Cohort Study

Lederer, L.; Roghanizad, A. R.; Howell, T. C.; Turnage, K.; Blazer, D. G.; Knackstedt, R.; Hwang, S.; Dunn, J.

2026-06-03 surgery 10.64898/2026.06.02.26354556 medRxiv
Top 0.1%
73.1%
Show abstract

Consumer wearable devices enable continuous passive physiologic monitoring in free-living conditions, yet their capacity to detect early postoperative deterioration following hospital discharge remains poorly characterized. Here we report a prospective observational cohort study evaluating multimodal wearable-derived physiologic signals across the perioperative period in adults undergoing elective oncologic surgery at Duke University Health System. Participants were monitored using an Oura Ring Gen 2 and Garmin Vivosmart 4 from at least two weeks preoperatively through up to 90 days postoperatively, alongside daily electronic patient-reported pain surveys. Devices captured 3,705 participant-days and 82,833 hours of physiologic data across 46 surgical patients. Oura adherence averaged 21.0 hours/day and was significantly higher than Garmin throughout the study period (17.6 hours/day). Garmin wear time declined significantly following surgery, while Oura adherence remained comparatively stable. Postoperative complications occurred in 17 participants (37%), including 10 major complications (Clavien-Dindo grade IIIb or higher) with a median onset of 13 days after surgery. Patients with major complications demonstrated significantly greater peak deviations from baseline in the first 10 postoperative days across resting heart rate, sleep temperature deviation, and readiness metrics. In the days before clinically documented major complications, wearable and patient-reported signals diverged from those of participants without major complications, with reduced activity appearing as early as four days before the event, followed by higher reported pain and later elevations in resting heart rate and sleep temperature deviation. These findings support the feasibility of prolonged perioperative wearable monitoring and suggest that physiologic deterioration preceding major surgical complications may be detectable days before clinical documentation, motivating further development and validation of wearable-based postoperative surveillance strategies.

2
Exploring the Interpretability of AI Decision Support Systems for Surgical Anatomy Recognition

Khan, D. Z.; Adams, T.; Wijekoon, A.; Ramirez Herrera, R.; Bano, S.; McCulloch, P.; Stoyanov, D.; Clarkson, M. J.; Costanza, E.; Blandford, A.; Marcus, H.; CARES Evaluation Group,

2026-06-03 surgery 10.64898/2026.06.02.26354729 medRxiv
Top 0.1%
53.3%
Show abstract

Artificial intelligence (AI) decision support systems for surgery hold promise but face barriers to adoption, particularly around the interpretability of their outputs. We conducted an international cross-sectional survey of 47 neurosurgeons to evaluate perspectives on literature-derived explanation techniques for AI-generated anatomical segmentations, using endoscopic pituitary surgery as a high-risk exemplar. Participants ranked certainty scores, certainty maps, saliency maps, scene similarity scores, and nearest-neighbour illustrations, and rated them using a modified Explanation Satisfaction Scale alongside free-text feedback. Certainty-based techniques were consistently ranked and rated highest for interpretability - valued for aligning with surgical decision-making by conveying confidence (via scores) and anatomical boundaries (via maps). Saliency- and similarity-based methods were judged less clinically relevant and better suited to educational settings. Certainty-based explanations, therefore, appear most acceptable to surgeons for clinical integration of decision support systems, though their impact on AI acceptability, trust calibration, and performance requires prospective evaluation across surgical domains.

3
Evaluating Sycophancy in Frontier Models Using Persona-Driven Challenge

Hazare, N. S.; Goel, N.; Yu, C.; Agaron, S.; Sharma, A.; Parchure, P.; Patel, D.; Timsina, P.; Kaplan, B.; Lampert, J.; Vakil, A.; Kovatch, P.; Darrow, B.; Glicksberg, B. S.; Charney, A.; Nadkarni, G. N.; Sakhuja, A.

2026-05-20 health informatics 10.64898/2026.05.17.26353406 medRxiv
Top 0.1%
53.0%
Show abstract

Large language models (LLMs) are increasingly used for lay health queries, yet may abandon correct recommendations under pressure, a vulnerability termed sycophancy. We evaluated sycophancy across five frontier LLMs (Claude Opus 4.6, Claude Sonnet 4.6, GPT 5.4, Grok 4.1, Gemini 3 Flash) using 200 synthetic clinical vignettes, each anchored to a unanimous correct treatment baseline and challenged by nine personas representing both vulnerable and authority roles. Overall, 7.1% of responses were sycophantic, varying tenfold across personas (1.7 to 19.3%) and sixfold across LLMs (2.4 to 15.3%). Vulnerable personas elicited more sycophantic responses, with medical student highest at the highest rate (19.3%). In adjusted Generalized Estimating Equations models, vulnerable personas continued to be independent predictors of sycophantic responses, which is a reversal of the expected authority gradient. In adjusted GEE models, persona and LLM were both independent predictors for sycophantic responses. Persona driven sycophancy evaluation should be integrated into pre deployment safety assessment of clinical LLMs.

4
Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

Yan, J.; Machlanski, D.; Butler, K.; Dimitrakopoulos, P.; Harrison, E. M.; Guthrie, B. M.; Tsaftaris, S. A.

2026-05-24 health informatics 10.64898/2026.05.21.26353781 medRxiv
Top 0.1%
52.2%
Show abstract

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

5
The Multimodal Anonymizer: a fully local multi-agent AI system for medical data deidentification

Hirsch, A.; Ten, F. W.; Krueger, K. S.; Geyer, R.; Roeschl, T.; Groeschel, M.; Rostin, P.; Eils, R.; Spott, M.; Prasser, F.; Meyer, A.; Madrid, J.

2026-06-05 health informatics 10.64898/2026.05.28.26353952 medRxiv
Top 0.1%
51.4%
Show abstract

Background: Safe reuse of multimodal hospital data for AI development is limited by the absence of reliable, context-aware deidentification across multimodal data and longitudinal patient data. Existing approaches are largely modality-specific and can indiscriminately remove clinically important information. Methods: We developed the Multimodal Anonymizer, a modular, locally deployable multi-agent framework integrating multimodal large language models, task-specific neural networks and rule-based transformations. We evaluated 16 orchestrator model configurations on a benchmark built from publicly available data and hospital data from our institution. The benchmark dataset included data from different origins: 250 MIMIC-IV patients with synthetically injected personally identifiable information (PII) supplemented with head CT, face images, handwriting, audio, German clinical-text datasets and local data. Primary outcomes were deidentification sensitivity and preservation of clinically important content; secondary analyses examined model characteristics, reproducibility, and performance against leading market and open-source solutions. Results: The best local configuration (the orchestrator being Qwen3-VL-235B-A22B-Thinking) achieved near-complete deidentification across all datasets, with per-patient sensitivity of 98.80% (95%-CI 97.20; 100), and per-PII sensitivity of 99.82% (95%-CI 99.76; 99.88). Critical clinical preservation was 99.60% (95%-CI 98.80; 100) per-patient, and clinical preservation was 99.61% (95%-CI 99.51; 99.71) per-file. All modalities achieved at least 98.30% sensitivity (lower bound 95%-CI). On our local data, the system achieved a deidentification sensitivity of 100% per-patient and per-PII; and a critical clinical preservation of 100% per-patient as well as a clinical preservation of 99.97% (95%-CI 99.91; 100) per-file. When comparing orchestrators, the leading local models were similar to proprietary models (GPT-5.2) in deidentification sensitivity while showing higher deidentification specificity. The Multimodal Anonymizer outperformed previous tools on most modalities. Conclusion: Near-complete, utility-preserving deidentification of multimodal clinical data is achievable with a unified, locally deployable multi-agent system, enabling safer large-scale reuse of hospital data for research and AI development.

6
An Explainable Multimodal AI Framework with Reinforcement Learning for Post-Surgical Clinical Decision Support

Ahmed, M.; Ahmed, F.; Mow, S. M.; Taha, P. A.; Barua, S.; Rahman, M. M.; Rafy, A.; Mondol, S. M.; Faisal, M. I.

2026-06-10 health informatics 10.64898/2026.06.08.26355217 medRxiv
Top 0.1%
48.1%
Show abstract

Post-surgical adverse outcomes, including mortality, intensive care readmission, and complications, remain major challenges for clinical decision-making. Existing machine learning approaches focus on outcome prediction while operating as opaque systems, limiting clinical trust and the translation of predictions into treatment decisions, and many clinical studies rely on synthetic data in which shared intermediate variables create circular dependencies between inputs and targets that compromise reported performance. We aimed to develop an explainable multimodal architecture and a rigorous evaluation methodology that address these gaps. We designed a two-stage architecture integrating supervised deep learning for risk prediction with conservative Q-learning for action recommendation. The first stage uses five modality-specific encoders for structured records, physiological time-series, chest radiographs, clinical notes, and surgical metadata, unified through cross-modal attention into a shared patient-state representation. The second stage applies offline reinforcement learning to recommend clinical actions while preventing value overestimation. We formally characterized a target-leakage flaw in synthetic pipelines and propose a real-data methodology using a verified clinical database, with event-censored temporal separation and uncertainty-weighted per-task training. Component-level behavior was validated on a controlled synthetic benchmark, demonstrating that the architecture functions as designed without claiming clinical validity. The cross-modal attention and risk-prediction components behaved as expected, whereas the offline reinforcement learning stage did not converge on the benchmark, indicating that value estimation requires further investigation on real clinical data. The architecture provides dual-level explainability through attention visualization and value decomposition, contributing a deployable design, a formal methodological critique of synthetic-data practices, and a complete framework for clinically valid evaluation.

7
Preliminary Reliability and Validity of SynapTrack, a Smartphone-Based Digital Biomarker Platform for Remote Assessment of Cervical Spondylotic Myelopathy

Yakdan, S.; Singh, P.; Arkam, F.; Chen, E.; Lewis, A.; Steel, B.; Becker, I.; Guo, W.; Naveed, H.; Wang, C.; Yang, D.; Wang, Z.; Ray, W. Z.; Hassenstab, J.; Steinmetz, M. P.; Ghogawala, Z.; Kelleher, C.; Greenberg, J.

2026-06-01 surgery 10.64898/2026.05.29.26354454 medRxiv
Top 0.1%
44.9%
Show abstract

Background and Objectives: Cervical spondylotic myelopathy (CSM) is a leading cause of neurological disability in older adults. However, validated, scalable tools to quantify disease severity and changes over time are lacking. Recent advances in smartphone technology have opened new avenues for longitudinal, objective, and remote monitoring of neurological conditions. We performed a preliminary evaluation of the reliability and validity of SynapTrack, a smartphone-based digital platform for objective remote CSM assessments. Methods: In this single-center prospective cohort study, 265 participants (151 with CSM, 114 healthy controls) completed in-person SynapTrack assessments related to tapping, pinching, and vibratory detection, along with reference laboratory measures of dexterity (Box and Block Test, 9-Hole Peg Test) and vibratory sensation (tuning fork). A subset completed repeated home-based testing to assess test-retest reliability. We evaluated convergent validity, construct validity against the modified Japanese Orthopedic Association (mJOA) score, known-groups validity, and test-retest reliability (intraclass correlation coefficient, ICC). Results: Smartphone-derived metrics demonstrated good-to-excellent test-retest reliability, with the strongest stability for vibratory detection threshold (ICC = 0.92), overall and non-dominant tapping speed (ICC = 0.90 each), and pinching successful targets (ICC = 0.90). Convergent validity was supported by moderate-to-strong correlations between digital metrics and reference laboratory dexterity tests ({rho} up to 0.60 for tapping speed; up to -0.65 for the vibratory threshold). Construct validity against the mJOA was strongest for the vibratory threshold ({rho} = -0.53 to -0.54) and Level 2 non-dominant pinching errors ({rho} = -0.45). Selected metrics distinguished CSM patients from controls with good discrimination, including non-dominant tapping speed (AUROC = 0.76, 95% CI 0.68-0.85), Level 2 dominant pinching successful targets (AUROC = 0.78, 95% CI 0.62-0.94), and the non-dominant vibratory threshold (AUROC = 0.77, 95% CI 0.64-0.90). Conclusions and Relevance: A smartphone-based battery of upper-extremity sensorimotor tasks demonstrated preliminary reliability and validity in CSM. Furthermore, to our knowledge, the novel vibratory detection task represents the first smartphone-based sensory assessment used for CSM. Collectively, these findings position SynapTrack as a scalable platform for objective, remote neurological monitoring of CSM.

8
An interpretable and interactive clinical AI agent for personalized anti-infective decision support in carbapenem-resistant Gram-negative bacterial infection

Cao, X.; Shi, D.; Du, Z.; Zhou, J.; Wang, Z.; Liu, Z.; Wang, Q.

2026-05-19 health informatics 10.64898/2026.05.18.26353005 medRxiv
Top 0.1%
39.4%
Show abstract

Carbapenem-resistant Gram-negative bacteria (CRGNB) infections remain difficult to manage because treatment decisions must balance heterogeneous patient risk, limited antibiotic options, potential toxicity and emerging resistance. Clinical care in this setting requires not only single-endpoint risk prediction, but also decision-support frameworks that can jointly enable prognosis assessment, result interpretation, and individualized treatment comparison. Here we present Dr.BUG, an interactive clinical AI agent for personalized decision support in CRGNB infection. Dr.BUG integrates stable feature-set selection, multi-task prognostic modelling, interpretability analysis and model-based simulation of antibiotic regimen recommendation into a unified workflow. Using a development cohort, a temporally independent validation cohort, and external cohorts from the MIMIC-IV dataset, we developed and validated models for four clinically relevant tasks: clinical efficacy, survival outcome, polymyxin resistance and treatment duration. Model inputs were derived primarily from routinely available and relatively low-cost clinical variables, supporting translational feasibility. Across the major tasks, selected-feature models matched or exceeded the performance of their full-feature counterparts while using fewer variables, as reflected in 82.0% of optimized-metric comparisons in the development cohort, and remained robust in both temporal and external validation. Dr.BUG further provided both population-level and patient-level interpretability and generated individualized rankings of candidate antibiotic regimens. In the retrospective analysis of non-survivors, clinician review suggested that regimens recommended by Dr.BUG might be associated with higher predicted survival probabilities. These findings support a broader role for clinical AI in complex drug-resistant infections, extending its utility from offline risk prediction to interpretable, deployable, and personalized decision support.

9
Three Decades of FDA Authorizations of AI/ML Enabled Medical Devices: Persistent Specialty Concentration and the Care Delivery Gap (1995 to 2025)

Golshani, P.; Joseph, M. S.

2026-05-12 health informatics 10.64898/2026.05.08.26352766 medRxiv
Top 0.1%
38.0%
Show abstract

The US Food and Drug Administration (FDA) maintains a public list of artificial intelligence and machine learning (AI/ML)-enabled medical devices that have received marketing authorization. Prior published analyses examined this list at earlier time points and reported a marked dominance of radiology applications. We performed a cross-sectional analysis of all 1,430 AI/ML-enabled medical device authorizations recorded by the FDA between September 1995 and December 2025 to characterize the cumulative growth, specialty distribution, and manufacturer concentration of authorized devices. The annual authorization volume increased from a mean of 1.8 per year between 1995 and 2014 to 264 per year between 2023 and 2025, with 331 authorizations recorded in 2025 alone. Devices reviewed by the FDAs Radiology panel accounted for 1,094 of 1,430 authorizations (76.5%), and the three most represented panels (Radiology, Cardiovascular, and Neurology) accounted for 90.6% of all authorizations. Several large clinical specialties were represented by very small numbers of authorized devices, including Pathology (n = 9), Microbiology (n = 6), and Obstetrics and Gynecology (n = 4). No authorizations were recorded under a psychiatry or behavioral health review panel. Of 740 unique companies, 502 (67.8%) had a single authorized device, while 13 companies (1.8%) accounted for 217 devices (15.2%). The cumulative regulatory record demonstrates rapid growth that has been concentrated in image-rich diagnostic specialties, with limited representation across many specialties that account for substantial clinical activity in the United States. These findings may inform policy discussions about where regulatory, infrastructure, and dataset investments are most needed to broaden the clinical scope of medical AI.

10
Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges: A Randomized Controlled Trial

Qazi, I. A.; Ali, A.; Khawaja, A. U.; Akhtar, M. J.; Sheikh, A. Z.; Alizai, M. H.

2026-06-02 health informatics 10.64898/2026.06.01.26354596 medRxiv
Top 0.1%
37.4%
Show abstract

As large language models (LLMs) enter clinical workflows, automation bias, the uncritical acceptance of automated output, poses a patient-safety risk. Optimal physician-AI collaboration requires trust calibration, matching scrutiny to LLM recommendation accuracy. We report a randomized trial evaluating a behavioral nudge to mitigate automation bias. Seventy-two AI-trained physicians were randomized to evaluate six vignettes alongside ChatGPT-5.1 recommendations, consulted at each physician's discretion; three contained deliberate, clinically significant errors. The treatment arm received a dual-component nudge: an anchoring cue reporting ChatGPT's benchmark accuracy to calibrate expectations, and a case-specific, selective-attention cue; a numeric accuracy rating and color-coded traffic light, derived from the mean of three distinct-family LLMs. The control group saw recommendations alone; blinded reviewers scored diagnostic reasoning against an expert rubric. The treatment group scored significantly higher (mean difference, 7.6 percentage-points; 95% CI, 1.4-13.9; P=0.016) than the control, suggesting a scalable strategy to preserve clinical judgment in LLM-assisted care. ClinicalTrials.gov registration: NCT07328815.

11
Optimising the Usability of AI Driven Augmented Reality Displays of Critical Structures During Surgery - An International Study of Surgeon-Computer Interaction

Ramirez Herrera, R.; Khan, D. Z.; Wijekoon, A.; Bano, S.; Clarkson, M. J.; Marcus, H.; Blandford, A.; CARES Evaluation Group,

2026-06-03 surgery 10.64898/2026.06.02.26354758 medRxiv
Top 0.1%
34.9%
Show abstract

In many endoscopic surgical procedures, the surgical team must identify and remove pathological tissue while avoiding critical structures such as arteries and nerves. Augmented reality (AR) offers potential support by overlaying visual information about the location of pathology and critical structures directly onto the operative field, enhancing spatial awareness and surgical navigation. However, limited research has evaluated how best to design and present AR overlays in ways that align with surgical workflow and perception. This study investigates surgeons' preferences across three key AR overlay dimensions: Design (how anatomy is visualised: outlines, heatmaps, masks, or centroids), Trigger (how and when overlays are activated: always visible, activated by the user, or triggered by instrument position), and Placement (where the overlay appears: above or below the surgical instrument). We take endoscopic pituitary adenoma surgery as a high-risk exemplar. Using a web-based prototype, 38 neurosurgeons ranked options and provided qualitative feedback. Surgeons preferred outline designs for clarity, user-activated triggers for control of information flow and distraction minimisation, and below-instrument placement for better spatial awareness. Preferences were consistent across experience levels and emphasised the importance of balancing visual saliency with cognitive load, to facilitate surgical navigation without distraction or disruption. These findings inform AR interface design, but require evaluation for impact on surgical performance and safety in further physical simulation and clinical studies.

12
Calibrating trust in AI-assisted pituitary surgery

Hudson, G. R.; Khan, D. Z.; Fayez, F.; Bhatia, S.; Bano, S.; Costanza, E.; Blandford, A.; Stoyanov, D.; McCulloch, P.; Marcus, H. J.; University College London Collaborators,

2026-06-04 surgery 10.64898/2026.06.02.26354735 medRxiv
Top 0.1%
34.9%
Show abstract

Background: Endoscopic endonasal transsphenoidal surgery (EETS) requires navigation around neurocritical anatomy. Today, artificial intelligence clinical decision support systems (AI-CDSSs) can orientate surgeons, but clinician trust in AI remains unclear, limiting safe deployment. This study evaluates how modifiable design affects trust and performance in a real-world pituitary surgery AI-CDSS. Method: Online, 70 clinicians with pituitary surgery experience were randomised evenly to a Basic or Enhanced AI-CDSS which outline the sella on EETS operative video. The Enhanced group additionally received explanation of the model and previous publications, alongside confidence labels depicting outline reliability. Both groups annotated the sella on six video clips, first alone then with the optional AI-CDSS. Clips were ordered by declining AI performance, except for the final clip. Self-reported trust was measured using a 1-7 scale after each annotation, and performance was the DICE overlap between user annotations and the ground truth. Comparisons used Mann-Whitney U and permutation analysis. Results: Sixty-four participants (91%) finished the exercise (31 Basic, 33 Enhanced). When AI performed best, median trust was 5.00 in both arms (U=559, p=.521). However, when AI performed worst, trust was significantly lower for the Enhanced group (3.00 vs 3.67, U=668, p=.035), sustained in the final clip (3.67 vs 4.33 U=687, p=.019). User performance improved with the AI-CDSS, but with no significant difference between the groups on the best or worst AI performing clips. Nevertheless, for the best AI, senior clinicians had higher median performance in the Enhanced group (0.95 vs 0.90, U=75, p=.066). There was also less dispersion in the Enhanced group when AI was inaccurate (IQR: 0.07 vs 0.21, p=.004). Conclusion: Interface design can improve trust calibration in a surgical AI-CDSS and may increment performance in seniors when AI is accurate, and consistency when AI is inaccurate. In future, these features may form important safety checks during translation to the operating room.

13
Agentic Chart Review from Longitudinal Clinical Notes: a Lung Cancer Guideline Concordance Use Case

Jiang, Y.; He, X.; Ai, X.; Jalal, S.; Maniar, R.; Majji, R. K.; Zhang, Y.; Liu, J.; Fedele, D.; Zhuang, Y.; Hollenbach, J.; Bian, J.

2026-06-03 oncology 10.64898/2026.06.02.26354727 medRxiv
Top 0.1%
34.5%
Show abstract

Clinical chart abstraction extracts structured patient variables from longitudinal clinical notes but is labor-intensive and difficult to scale. We evaluated LLM agents for question-guided chart review using lung cancer molecular testing guideline concordance as a use case. Two configurations were compared: (1) sequential note review using metadata and chronology, and (2) the same framework augmented with keyword-based note search. Gold-standard labels were established by human annotators. The search-enabled agent achieved higher accuracy (92.4% vs. 83.5%) and reduced errors by more than half (41 vs. 89) by retrieving evidence from long, heterogeneous note histories. In guideline concordance evaluation, most determinate patient-rule assessments were concordant (80.7%), while most apparent non-concordance reflected missing molecular testing documentation rather than documented care deviations. These results suggest tool-augmented LLM agents can approximate key aspects of human chart review and support scalable information extraction from longitudinal clinical documentation.

14
Asymmetry between warmth and clinical substance in multilingual consumer health AI

Ariel, D.; Grumberg, L. R.; Supakul, S.; Wannasri, S.; Mitchnik, I. Y.; Lev, A.; Ariyamethanon, W.; Agbarieh, M.; Miari, S.; Laban, G.; Hasid, B.

2026-05-14 health informatics 10.64898/2026.05.09.26352813 medRxiv
Top 0.1%
33.9%
Show abstract

The same patient question can yield different clinical quality across languages. Across 504 forum-derived patient queries in six languages and four chatbots, language-matched clinicians rated responses on five clinical dimensions (1,008 ratings; 5,040 dimension scores). Patient language outweighed chatbot identity across the four clinical-substance dimensions (composite language partial {superscript 2} 0.275 vs chatbot 0.035; robust to investigator-rating exclusion: {superscript 2} 0.260) but not for empathy ({superscript 2} 0.029): clinical substance was language-associated; warmth was relatively preserved. Catastrophic safety ratings ranged 4.3-fold by language (3.6% English, 15.5% Thai and Hebrew); 62% of catastrophic ratings exceeded the English baseline (descriptive disparity). Failures were systematic and silent: none of 24 stroke responses conveyed time-criticality framing, none of 24 CO-poisoning responses challenged the familys stress framing, and 120 sentinel responses contained no confident errors. Warmth did not discriminate clinical danger (response-level empathy AUC = 0.49): consumer health AI can deliver fluent, caring tone with degraded clinical substance.

15
Medical discrimination and the selective erosion of institutional health trust: evidence from the Health Information National Trends Survey 6 and 7

Park, A.; Yin, L.; Wong, A.; Lee, C.; Choi, Y.

2026-06-09 public and global health 10.64898/2026.06.06.26355057 medRxiv
Top 0.1%
33.9%
Show abstract

Medical discrimination may alter how patients relate to health information sources following adverse care encounters. We examined whether discrimination experience is associated with selective erosion of institutional health trust and with compensatory digital health engagement, using nationally representative data from the Health Information National Trends Survey (HINTS) 6 (2022; n=6,252) and HINTS 7 (2024; n=7,278). Survey-weighted modified Poisson regression estimated prevalence ratios (PRs) for binary high-trust outcomes, and survey-weighted ordinary least squares estimated coefficients for continuous outcomes; jackknife replicate weights (50 replicates) provided variance estimates. Discrimination was associated with substantially lower probability of high trust in the healthcare system (PR=0.39; 95% CI 0.30-0.52) and physicians (PR=0.85; 95% CI 0.77-0.94), with no significant association for trust in scientists, government, family, or religious organisations. The clinical-institutional pattern replicated in HINTS 6, which additionally showed reduced trust in scientists for race/ethnicity-based discrimination. Contrary to a disengagement hypothesis, discrimination-exposed adults showed higher probability of online health information seeking (PR=1.06), health app use (PR=1.11), and online provider messaging (PR=1.13); these associations persisted after adjustment for trust in physicians. Discrimination was independently associated with lower health self-efficacy (b=-0.271). Medical discrimination selectively erodes trust in clinical institutions while leaving broader epistemic trust largely intact. Despite this, discrimination-exposed patients engage more actively with digital health channels, consistent with compensatory reorientation toward non-clinical information sources. These findings describe engaged but institutionally alienated patients, with implications for restoring clinical trust and for equity-centred digital health design.

16
Digital biomarkers for insulin resistance screening in daily life

Jovanova, M.; Bruegger, V.; Svirhrova, R.; Fuchs, M.; Jin, Q.; Wortmann, F.; Mitter, M.; Bechny, M.

2026-05-22 health informatics 10.64898/2026.05.20.26353669 medRxiv
Top 0.1%
33.7%
Show abstract

One in four adults has insulin resistance (IR), a modifiable driver of type-2 diabetes that can precede diagnosis by a decade. However, IR assessment remains clinic- and laboratory-based, limiting repeated population screening. We tested whether free-living wearable data can detect IR in adults with normoglycemia or prediabetes. Machine-learning models using continuous glucose monitor (CGM)-based glucose dynamics and smartwatch-based heart rate/heart rate variability were developed in Study 1 (N = 97) and externally validated without retraining in Study 2 (N = 61, 31% IR prevalence). The best-performing CGM-based model achieved AU-ROC = 0.873 [0.756-0.967] and AU-PRC = 0.816 [0.640-0.934], outperforming an anthropometrics-only baseline (AU-ROC = 0.749, AU-PRC = 0.593). Findings are the first to detect IR from wearables without blood tests or structured glucose challenges, with state-of-the-art comparable performance. By enabling continuous at-home screening, this approach can identify undetected at-risk individuals and trigger confirmatory blood tests to close detection gaps.

17
An AI-Powered Smartphone Application for Universal and Standardized Reading and Interpretation of Lateral Flow Assays

Bermejo-Pelaez, D.; Darias, O.; Pastor, L.; Valles, R.; Diez, N.; Lin, L.; Garcia-Villena, J.; Cuadrado, D.; Vladimirov, A.; Alamo, E.; Postigo, M.; Rodriguez-Dominguez, M.; Canton, R.; Rodriguez-Tudela, J. L.; Alastruey Izquierdo, A.; Bohorquez, L. C.; Rubio, J. M.; Dacal, E.; Luengo-Oroz, M.

2026-05-18 public and global health 10.64898/2026.05.14.26352875 medRxiv
Top 0.1%
33.7%
Show abstract

Introduction. Lateral flow assays (LFAs) are indispensable rapid diagnostic tools in healthcare, enabling point-of-care diagnosis critical for patient management and support disease burden assessment and surveillance when results are properly recorded. However, misinterpretation errors and unreported cases remain a concern. A quality-assured, affordable Ai-powered tool, supporting the decision-making during result interpretation could promote proper disease monitoring and epidemiological surveillance. Here, we describe the performance of a universal AI model to digitize and interpret results from multiple LFA types through a smartphone application, a step that could ultimately enable standardized and digitally reportable test outcomes. Methods. The AI algorithm was evaluated in 17 LFA types, including both 2-band and 3-band tests for different diseases and manufacturers. The model was trained on a dataset of 22,576 images captured under diverse lighting conditions with different smartphone models and using a custom mobile application, TiraSpot (Spotlab, Madrid, Spain). To assess generalizability, a leave-one-out cross-validation was applied, where in each LFA type was iteratively excluded from training and used for testing. Model performance was evaluated using bootstrapping on the inference dataset. Results. In the assessment of the model's ability to generalize to new LFA types not previously analyzed (not included during development), the model achieved an overall AUC of 94.3% for second band detection. This overall performance was enhanced to 99.3% (Sensitivity=98,6%; Specificity=98%) after training with 50 images of each LFA type, highlighting the benefit of additional data for specific LFA types. For the third band detection, where less training data was available, the system achieved an overall AUC of 83.9% for unseen LFAs, improving to 94.2% (Sensitivity=92.9%; Specificity=87,9%) after training with 50 images of each LFA type. Conclusion. This system demonstrates the feasibility of an AI-powered universal digital reader for interpreting LFA results from diverse test types using smartphone-captured images. Its compatibility with standard smartphones makes it a universal tool, enabling reliable LFA interpretation across devices and settings. By standardizing test interpretation and digitizing results, this tool could support decision making in result interpretation, enhancing epidemiological surveillance, particularly in resource-limited settings. Its adaptability across various infections highlights its potential to improve diagnostic consistency and support disease management in diverse healthcare settings.

18
A Data-Driven Framework for Generating Population-Linked Case Vignettes from Nationwide Triage Data

Seidel, A.; Steiger, E.; Schuster, J.; Kroll, L. E.

2026-06-10 health informatics 10.64898/2026.06.08.26354886 medRxiv
Top 0.1%
33.6%
Show abstract

Background: Digital decision-support tools such as triage systems and symptom checkers support millions of health-related decisions each year. Their quality and safety are commonly evaluated using textual patient cases, known as case vignettes. However, existing vignette sets written by medical experts cover only a limited spectrum of real-world patient presentations and lack population weights, which would allow extrapolating evaluation results to the underlying patient population. Objective: This study aims to develop a data-driven framework for automatically generating a human-manageable set of case vignettes from nationwide triage data that captures broad presentation diversity and links each vignette to a quantitative weight reflecting the number of underlying patient assessments. Methods: From 3.2 million triage assessments conducted over one year using structured triage software in the German medical on-call service (telephone triage and online self-triage) and at the joint contact points of the outpatient emergency care service and hospital emergency departments, we randomly sampled 50,000 cases. Triage questionnaires were converted into semantic embeddings using a German Sentence Transformer Model and grouped by agglomerative clustering. For clusters containing sufficient assessments, we generated one representative assessment using a two-phase simulated-annealing optimization. The optimization minimized the distance to the cluster centroid while maximizing the number of answered triage questions, aiming for high representativeness and information content. Each representative assessment was assigned the size of its source cluster as its sample-based weight. A similarity-based sensitivity analysis was performed to examine whether these weights were preserved in the full 1-year population. Finally, the question-answer pairs of the representative assessments were converted into structured textual case vignettes using controlled prompting of a large language model. Results: The cluster analysis yielded 514 included clusters covering 96.8% of the sampled 50,000 assessments. The generated representatives showed strong agreement with the majority treatment-urgency recommendation of their source cluster (Spearman's {rho}=0.78, p<0.001) and contained on average 4.3 more answered triage questions than the original assessments within their clusters. When weighted by cluster size, the representatives approximated the sample distributions of treatment urgency, demographics, and symptoms, although some systematic deviations remained, most notably an overrepresentation of female cases (+13.5%), patients aged 14-49 years (+8.0%), and the urgency category "As soon as possible" (+6.6%). Of 121 recorded symptoms, 101 (83.5%) were covered by the representatives; the rest each occurred in <0.5% of the sample. In a sensitivity analysis, cluster-based vignette weights were strongly correlated with similarity-based population weights (Spearman's {rho}=0.77, p<0.001), and 90.1% of assessments in the full 1-year population were matched to at least one vignette. Conclusions: We present a data-driven framework for deriving a manageable set of population-weighted case vignettes from nationwide triage data. The resulting vignettes captured broad presentation diversity, approximated key sample characteristics, and provided an explicit quantitative link to the number of underlying patient assessments. After medical expert review and refinement, the vignettes may support more population-aware evaluation and quality assurance of digital decision-support tools.

19
Noninvasive Hypokalemia Detection from Single-Lead AI-ECG: Development, Multicenter Validation, and Prospective Pilot Study in the Emergency Department

Tang, G.; Li, X.; Xiao, Y.; Wang, K.; Wu, M.; Wei, Z.; Yu, M.; Chen, X.; Hong, W.; Cheng, F.; Li, X.; Zhang, J.; Wu, X.; Hong, S.

2026-06-01 health informatics 10.64898/2026.05.23.26353774 medRxiv
Top 0.1%
33.3%
Show abstract

Hypokalemia is a common and potentially life-threatening electrolyte abnormality in emergency care, yet rapid noninvasive screening remains difficult in time-critical triage settings. We developed PocketED-K, a single-lead AI-ECG prescreening model initialized from ECGFounder, and evaluated it in retrospective multicenter cohorts and a prospective handheld pilot. Retrospective development and validation included 37,115 patients from MC-MED and MIMIC-ED, and the pilot enrolled 18 patients at Peking University First Hospital. Hypokalemia was defined as venous serum potassium < 3.5 mmol/L. PocketED-K achieved AUROCs of 0.8189 (95% CI 0.8172--0.8207) in internal testing, 0.8104 (95% CI 0.8092--0.8115) in temporal validation, and 0.7889 (95% CI 0.7692--0.8074) in independent external validation; external negative predictive value was 0.9911 (95% CI 0.9895--0.9925). Higher predicted risk was associated with ST-segment depression, T-wave flattening or inversion, and relative U-wave prominence. The prospective handheld pilot provided an initial signal of workflow feasibility in real-world acquisition. These findings support single-lead AI-ECG as a low-burden prescreening tool to prioritize potassium testing in emergency care.

20
Evidence-Graded Decision Authorization for Safe Clinical AI: A Constrained Reasoning Framework

Lin, C.; Lin, J.-Y.; Lin, Y.-S.

2026-05-22 health informatics 10.64898/2026.05.19.26353565 medRxiv
Top 0.1%
32.9%
Show abstract

Clinical AI systems have achieved strong predictive performance; however, prediction accuracy is not sufficient for clinical safety. Retrieval-augmented generation (RAG) improves factual accuracy, and general-purpose LLM guardrails constrain surface-level output safety, but these mechanisms do not govern the inferential gap between available clinical evidence and permissible clinical claims. We propose Evidence-Graded Decision Authorization (EGDA), a framework that separates evidence extraction, sufficiency assessment, and claim-level authorization through domain-specific rules. In a controlled experiment using 60 breast cancer decision-snapshot cases (1,260 system outputs across three arms evaluated by LLM-as-Judge with expert calibration), EGDA reduced the unjustified inference rate to 8.0% (vs. 48.7% for unconstrained LLM and 47.7% for RAG; risk difference vs. unconstrained -40.7%, 95% CI -46.9 to -34.0, p < 0.001), raised the appropriate refusal rate to 95.0% (vs. 56.9% and 56.9%; risk difference vs. unconstrained +38.1%, 95% CI +31.3 to +44.5, p < 0.001), and achieved the highest factual correctness at 96.4% (vs. 69.8% and 74.5%). An unexpected finding was that retrieval-augmented generation without an authorization gate failed to reduce unjustified inference relative to the unconstrained baseline (47.7% vs. 48.7%, p = 0.870) and produced no improvement in appropriate refusal (56.9% vs. 56.9%, p = 1.0), showing that information supply alone is not sufficient for inferential governance. We argue that domain-specific, evidence-graded reasoning governance should serve as a deployment reference standard for safety-critical clinical AI.